Investigating Documents#
Web App#
When you click ‘Documents’ on the navigation bar you will be presented with this page. This page is mainly for investigating differences between documents and topic distributions in a corpus.
Note
With large corpora this page is by far the slowest to start up, as such I would recommend that you disable it unless you have special interest in the individual documents in your corpus. This is because document representations are high-dimensional and there is usually a lot of documents, calculating 2D UMAP projections for such data is slow and tedious.
You can disable the page in visualize()
topicwizard.visualize(corpus=corpus, pipeline=pipeline, exclude_pages=["documents"])
Document map#
On the left you will see a plot showing you all the documents, aka. the document map.
Document positions are calculated from document embeddings created by the vectorizer by reducing dimensionality to 2 with UMAP. You can zoom this graph by dragging your cursor and enclosing a selection area.
Selecting documents#
You can either select documents by clicking on them on the document map, or by searching for them in the “Select document” field above the map.
Wordcloud#
The most used words in the document are displayed on a wordcloud.
The wordcloud is draggable with the cursor and zoomable by scrolling.
Topic use#
Use of topics in the document is displayed with a pie chart.
Topic Timeline#
You will also see a timeline which visualizes the use of topics over time in the document.
You can remove topics from the plot by clicking them on the right. You can select individual topics by double clicking on them.
Topic use is calculated with rolling windows of words over the document. You can adjust window size by dragging the slider on top.
Self-Contained Plots#
It might be an overkill for you to display the entire page, and you might want static html plots instead of the entire application running. This can be particulary useful for reports with DataPane or Jupyter Notebooks.
Document Map#
You can display a map of documents as a self-contained plot. This can be advantageous when you want to see how your topic model maps onto embedding space or see how different documents relate to each other in the corpus.
This plot is not entirely identical to the one in the app, as documents cannot be selected or searched for.
Different topics are clearly outlined with discrete colors.
You can also choose whether you want to use the representations produced by the vectorizer or the topic model for visualization. This can be particularly useful if you use a topic model where the representations are not based on the bag-of-words representations, like BERTopic for example (stay tuned for another fun package btw :)).
from topicwizard.figures import document_map
# Term-based representations, aka. vectorizer output
document_map(corpus=texts, pipeline=pipeline, representation="term")